PCA: Income Data with Python

Using sklearn

1 Introduction

This document demonstrates how to perform Principal Component Analysis (PCA) in Python using the scikit-learn library. PCA is a dimensionality reduction technique that transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components. We will use the adult_income_dataset.csv for this demonstration.

2 Load Data

First, we load the necessary libraries and the income dataset.

Code
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

# Load the income dataset
income_df = pd.read_csv("../data/adult_income_dataset.csv")
Code
# Handle missing values and sample data
income_df_clean = income_df.drop('income', axis=1).dropna().sample(n=1000, random_state=42)

# Separate numerical and categorical columns
numerical_cols = income_df_clean.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = income_df_clean.select_dtypes(include=['object']).columns

# One-hot encode categorical features
income_df_encoded = pd.get_dummies(income_df_clean, columns=categorical_cols, drop_first=True)

# Standardize numerical features
scaler = StandardScaler()
income_df_encoded[numerical_cols] = scaler.fit_transform(income_df_encoded[numerical_cols])

scaled_income = income_df_encoded.values

3 Principal Component Analysis

We will perform PCA on the preprocessed income data.

Code
pca = PCA()
pca.fit(scaled_income)

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Scree plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, marker='o')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.grid(True)
plt.show()

# Cumulative explained variance
cumulative_explained_variance = explained_variance_ratio.cumsum()
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o')
plt.title('Cumulative Explained Variance')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()

4 PCA Results Visualization

We can visualize the data projected onto the first two principal components.

Code
pca_2d = PCA(n_components=2)
principal_components_2d = pca_2d.fit_transform(scaled_income)

principal_df_2d = pd.DataFrame(data = principal_components_2d, columns = ['principal component 1', 'principal component 2'])

plt.figure(figsize=(10, 6))
sns.scatterplot(x='principal component 1', y='principal component 2', data=principal_df_2d, palette='viridis', s=100)
plt.title('PCA of Income Data (First Two Components)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

5 Conclusion

This document provided an overview of Principal Component Analysis in Python using scikit-learn. We demonstrated how to perform PCA on the income dataset and visualize the results.

6 Keeping Top 5 Components

We can select and work with a reduced number of principal components, for example, the top 5 components that explain a significant portion of the variance.

Code
pca_5 = PCA(n_components=5)
principal_components_5 = pca_5.fit_transform(scaled_income)

print("Shape of data after keeping top 5 components:", principal_components_5.shape)
print("First 5 rows of the top 5 principal components:\n", principal_components_5[:5])

# Explained variance by the top 5 components
print("Explained variance ratio of top 5 components:", pca_5.explained_variance_ratio_)
print("Cumulative explained variance of top 5 components:", pca_5.explained_variance_ratio_.cumsum()[-1])
Shape of data after keeping top 5 components: (1000, 5)
First 5 rows of the top 5 principal components:
 [[-0.66259976 -0.6901712  -0.12605474 -0.50941084  0.03776418]
 [ 0.20225145  0.96781646 -0.85529215 -0.86061386  0.21421535]
 [ 1.04905925 -1.10717674  0.30121431 -0.15852958  0.92758685]
 [ 0.05573208 -1.37283113  0.10129154 -0.58123139 -0.12323307]
 [ 0.11670067 -0.64810635  0.06004221  0.06858625  0.82368069]]
Explained variance ratio of top 5 components: [0.15732576 0.11650394 0.10643612 0.10423658 0.09096257]
Cumulative explained variance of top 5 components: 0.575464968553261